HEAD ======= >>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
covidHubUtils R packageThe COVID-19 Forecast Hub is a central repository for modeler-contributed short-term forecasts of COVID-19 trends in the US. The US Centers for Disease Control and Prevention (CDC) displays forecasts from the Forecast Hub on its modeling and forecasting webpages.
The Forecast Hub has been curating forecast data since April 2020, and has collected over 150 million unique rows of forecast data. These data are stored in our public GitHub repository and in the Zoltar forecast archive.
The goal of the covidHubUtils R package is to create a set of basic utility functions for accessing, visualizing, and scoring forecasts from the COVID-19 Forecast Hub.
The covidHubUtils package relies on a small number of packages, including many from the tidyverse and, importantly, the zoltr package that is used to access the Zoltar API for downloading forecasts. Please install zoltr from GitHub, as this development version often has important features not yet on the CRAN version:
devtools::install_github("reichlab/zoltr")
The covidHubUtils package currently is only available on GitHub, and it may be installed using the devtools package:
devtools::install_github("reichlab/covidHubUtils")
One of the key features of the COVID-19 Forecast Hub is making millions of rows of forecast data available in a standard format for easy analysis and visualization. The covidHubUtils package allows for users to download data into an R session either by reading files from a local clone of the COVID-19 Forecast Hub repository or by downloading data from the Zoltar API. (While Zoltar currently requires a user account to download data via the API, we have created a specific user account for covidHubUtils so that a user account is not needed.)
We have identified two central use cases for downloading data:
load_latest_forecasts() function.load_forecasts() function.Below are some examples of reading in data. We start by loading the covidHubUtils package and the tidyverse.
library(covidHubUtils)
library(tidyverse)
theme_set(theme_bw())
The following code loads 1 through 4 week ahead incident case forecasts from Zoltar for the COVIDhub-ensemble model. Note that the forecast_date_window_size option specifies the range of days to look at for a forecast. So the query below is looking for the most recent forecast from COVIDhub-ensemble in the span of 2020-12-01 through 2020-12-07. We have decided to set forecast_date_window_size to 6 because a window size of 0 still includes the last forecast date you provided. So that last date minus 6 days will cover a week, starting from 6 days before the last date and going up through the last date.
We also load the hospitalization data and incident deaths here by changing the target to the desired quantity and load that dataset to further work on. The target variable can be changed to what is desired by changing the appropriate lines of code.
The verbose parameter has been explicitly set to false here to prevent the multiple lines of output of the inner workings of the load function. It is set to false by default and all the information is printed to the console unless specified otherwise.
# Load forecasts that were submitted in a time window from zoltar
inc_case_targets <- paste(1:4, "wk ahead inc case")
forecasts <- load_latest_forecasts(models = "COVIDhub-ensemble",
last_forecast_date = "2020-12-07",
forecast_date_window_size = 6,
locations = "US",
types = c("point","quantile"),
targets = inc_case_targets,
<<<<<<< HEAD
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
hosp_data <- paste(0:130, "day ahead inc hosp")
forecasts_hosp <- load_latest_forecasts(models = "COVIDhub-ensemble",
last_forecast_date = "2020-12-07",
forecast_date_window_size = 6,
locations = "US",
types = c("point","quantile"),
targets = hosp_data,
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
inc_case_targets_death <- paste(1:4, "wk ahead inc death")
forecasts_death <- load_latest_forecasts(models = "COVIDhub-ensemble",
last_forecast_date = "2020-12-07",
forecast_date_window_size = 6,
locations = "US",
types = c("point","quantile"),
targets = inc_case_targets_death,
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
And here is the data that is returned from this query, note that in addition to the essential forecast data itself, some additional columns with information about the locations are returned.
This data can then be plotted directly with a call to plot_forecast(). Examples for the inc
p <- plot_forecast(forecast_data = forecasts,
truth_source = "JHU",
target_variable = "inc case",
intervals = c(.5, .8, .95))
p_hosp <- plot_forecast(forecast_data = forecasts_hosp,
truth_source = "HealthData",
target_variable = "inc hosp",
intervals = c(.5, .8, .95))
p_death <- plot_forecast(forecast_data = forecasts_death,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95))
## polling for status change. job_url=https://zoltardata.com/api/job/49991/
## QUEUED
## QUEUED
## SUCCESS
And here is the data that is returned from this query, note that in addition to the essential forecast data itself, some additional columns with information about the locations are returned.
This data can then be plotted directly with a call to plot_forecasts().
p <- plot_forecasts(forecast_data = forecasts,
truth_source = "JHU",
target_variable = "inc case",
intervals = c(.5, .8, .95))
Note that many additional arguments may need to be passed to this function to have a reasonable plot returned if your dataset has multiple locations, forecast dates or models.
Additionally the resulting plot object can be modified. For example, if you want to change the way the x-axis handles dates, you could add a custom ggplot scale_x_date() specification:
p + scale_x_date(name=NULL, date_breaks = "1 month", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"), axis.text.x = element_text(vjust = 7, hjust = -0.2))
<<<<<<< HEAD
Additionally, plot_forecast() can handle plotting multiple models or locations or forecast dates at the same time as the following examples show.
Additionally, plot_forecasts() can handle plotting multiple models or locations or forecast dates at the same time as the following examples show.
# Load forecasts that were submitted in a time window from zoltar
# Load forecasts that were submitted in a time window from zoltar
inc_case_targets <- paste(1:4, "wk ahead inc case")
forecasts_ECDC <- load_latest_forecasts(models=c("ILM-EKF"),
hub = c("ECDC","US"), last_forecast_date = "2021-03-08",
forecast_date_window_size = 0,
locations = c("GB"),
targets = paste(1:4, "wk ahead inc death"),
source = "zoltar")
## polling for status change. job_url=https://zoltardata.com/api/job/51989/
## QUEUED
## SUCCESS
truth <- load_truth("JHU",hub = c("ECDC","US"),
target_variable = "inc death", locations = "GB")
scores <- score_forecasts(forecasts, truth)
datatable(forecasts_ECDC,
extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
))
#p_ECDC <- plot_forecasts(forecast_data = forecasts_ECDC,
# truth_source = "JHU",
# target_variable = "inc case",
# intervals = c(.5, .8, .95)
#)
The following code looks at three models’ forecasts of incident deaths at one time point for one location. Note the use of the fill_by_model option which allows colors to vary by model and the facet command which is passed to ggplot.
fdat <- load_latest_forecasts(models = c("Karlen-pypm", "UMass-MechBayes", "CU-select"),
last_forecast_date = "2020-12-14",
source = "zoltar",
forecast_date_window_size = 6,
locations = "US",
types = c("quantile", "point"),
targets = paste(1:4, "wk ahead inc death"))
<<<<<<< HEAD
## polling for status change. job_url=https://zoltardata.com/api/job/51990/
=======
## polling for status change. job_url=https://zoltardata.com/api/job/49992/
## QUEUED
## QUEUED
>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
## QUEUED
## SUCCESS
p <- plot_forecasts(fdat,
target_variable = "inc death",
truth_source = "JHU",
intervals = c(.5, .95),
facet = .~model,
fill_by_model = TRUE,
plot=FALSE)
p +
scale_x_date(name=NULL, date_breaks = "1 months", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2))
<<<<<<< HEAD

=======

>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
The following code looks at three models’ forecasts of incident deaths at one time point for multiple locations. Note the use of the facet_scales option which is passed to ggplot and allows the y-axes to be on different scales.
fdat <- load_latest_forecasts(models = c("Karlen-pypm", "UMass-MechBayes", "CU-select"),
last_forecast_date = "2020-12-14",
source = "zoltar",
forecast_date_window_size = 6,
locations = c("19", "48", "46"),
types = c("quantile", "point"),
targets = paste(1:4, "wk ahead inc death"))
<<<<<<< HEAD
## polling for status change. job_url=https://zoltardata.com/api/job/51992/
=======
## polling for status change. job_url=https://zoltardata.com/api/job/49993/
>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
## QUEUED
## QUEUED
## SUCCESS
p <- plot_forecasts(fdat,
target_variable = "inc death",
intervals = c(.5, .95),
truth_source = "JHU",
facet = location~model,
facet_scales = "free_y",
fill_by_model = TRUE,
plot=FALSE)
p +
scale_x_date(name=NULL, date_breaks = "1 months", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2))
<<<<<<< HEAD

=======

>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
The following code looks at three models’ forecasts of incident deaths at one time point for multiple locations. Note the use of the facet_scales option which is passed to ggplot and allows the y-axes to be on different scales.
fdat <- load_forecasts(models = c("Karlen-pypm", "UMass-MechBayes"),
forecast_dates = seq.Date(as.Date("2020-09-06"), as.Date("2020-12-13"), by = "28 days"),
locations = "US",
types = c("quantile", "point"),
targets = paste(1:4, "wk ahead inc death"))
<<<<<<< HEAD
## polling for status change. job_url=https://zoltardata.com/api/job/51993/
=======
## polling for status change. job_url=https://zoltardata.com/api/job/49994/
## QUEUED
>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
## QUEUED
## QUEUED
## SUCCESS
p <- plot_forecasts(fdat,
target_variable = "inc death",
truth_source = "JHU",
intervals = c(.5, .95),
facet = .~model,
fill_by_model = TRUE,
plot = FALSE)
p + scale_x_date(name=NULL, date_breaks = "1 months", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2))
<<<<<<< HEAD

=======

>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
By default plot_forecasts() uses JHU CSSE data as the “Observed Data” in the above plots. However, users can specify custom “ground truth” data that either they provide themselves or that is loaded in from the package.
Here is an example of a call to plot_forecasts() that simply specifies an alternate truth source, which must be one of “JHU”, “USAFacts”, or “NYTimes”.
plot_forecasts(forecast_data = forecasts,
target_variable = "inc case",
locations = "US",
truth_source = "USAFacts",
intervals = c(.5, .8, .95))
<<<<<<< HEAD
Alternatively, truth data can be loaded in from one of those sources independently and stored in your active R session and passed to the plot_forecast() function.
truth_data <- load_truth(truth_source = "JHU",
target_variable = "inc case",
locations = "US")
Truth data comes in the following tabular format.
And can be used in conjunction with a call to plot_forecast
plot_forecast(forecast_data = forecasts,
=======

Alternatively, truth data can be loaded in from one of those sources independently and stored in your active R session and passed to the plot_forecasts() function.
truth_data <- load_truth(truth_source = "USAFacts",
target_variable = "inc case",
locations = "US")
Truth data comes in the following tabular format.
And can be used in conjunction with a call to plot_forecasts
plot_forecasts(forecast_data = forecasts,
>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
target_variable = "inc case",
truth_data = truth_data,
truth_source = "USAFacts",
intervals = c(.5, .8, .95))
<<<<<<< HEAD

=======

>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
In addition to downloading forecasts, covidHubUtils has the capability to evaluate the forecasts based on metrics including the prediction interval coverage at any provided quantiles, the absolute error based on a median estimate, the weighted interval score (WIS) of the forecast, and a component-wise breakdown of WIS into sharpness, overprediction and underprediction.
The inputs to the scored_forecasts() include a dataframe created using load_forecasts() and a dataframe created using truth_data().
The following code creates a dataframe of scored data based on the forecasts and truth data loaded earlier in this vignette in long format.
<<<<<<< HEADforecasts_multiple <- load_latest_forecasts(models = c("COVIDhub-baseline", "COVIDhub-ensemble"),
last_forecast_date = "2020-12-07",
forecast_date_window_size = 6,
locations = "US",
types = c("point","quantile"),
targets = inc_case_targets,
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
scores <- score_forecasts(forecasts = forecasts_multiple,
return_format = "wide",
truth = truth_data)
truth <- load_truth("JHU",hub = c("ECDC","US"), target_variable = "inc death", locations = "GB")
scores_ECDC <- score_forecasts(forecasts = forecasts_ECDC,
return_format = "wide",
truth = truth)
=======
scores <- score_forecasts(forecasts = forecasts,
return_format = "long",
truth = truth_data)
>>>>>>> aded6f92ead8ac02456d2f3664fd301a8134272e
The scores calculated using this functionality can be used to compare the accuracy and precision of forecasts across models, locations, horizons, and submission weeks.
You can access the evaluation paper to read an in depth explanation about the methodologies and you can also visit the evaluation reports page.
Scores can be visualized through number of aggregations and stratification and the following code should be a good starting point to get to visualizing the scores for your forecasts:
calibration_df <- scores %>%
group_by(model) %>%
summarise(mean_PI50_recent = round(sum(coverage_50, na.rm = TRUE) / n(),2),
mean_PI95_recent = round(sum(coverage_95, na.rm = TRUE) / n(),2)) %>%
ungroup()
datatable(
calibration_df,
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
))